The Group Square-Root Lasso: 
Theoretical Properties and Fast 
Algorithms 

Florentina Bunea, Johannes Lederer, and Yiyuan She 

^ Cornell University 
e-mail: f b238@cornell . edu 

'^ETH Ziirich 
e-mail: ledererSstat .math. ethz . ch 

■^Florida State University 
e-mail: ysheOstat . f su. edu 

Abstract: Wc introduce and study the Group Square-Root Lasso (GSRL) 
method for estimation in high dimensional sparse regression models with 
group structure. The new estimator minimizes the square root of the resid- 
ual sum of squares plus a penalty term proportional to the sum of the 
Euclidean norms of groups of the regression parameter vector. The net ad- 
vantage of the method over the existing Group Lasso (GL)-type procedures 
consists in the form of the proportionality factor used in the penalty term, 
which for GSRL is independent of the variance of the error terms. This is of 
crucial importance in models with more parameters than the sample size, 
when estimating the variance of the noise becomes as difficult as the original 
problem. We show that the GSRL estimator adapts to the unknown sparsity 
of the regression vector, and has the same optimal estimation and predic- 
tion accuracy as the GL estimators, under the same minimal conditions 
on the model. This extends the results recently established for the Square- 
Root Lasso, for sparse regression without group structure. Moreover, as a 
new type of result for Square- Root Lasso methods, with or without groups, 
we study correct pattern recovery, and show that it can be achieved under 
conditions similar to those needed by the Lasso or Group-Lasso-type meth- 
ods, but with a simplified tuning strategy. Wc implement our method via a 
new algorithm, with proved convergence properties, which, unlike existing 
methods, scales well with the dimension of the problem. Our simulation 
studies support strongly our theoretical findings. 
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1. Introduction 

Variable selection in high dimensional linear regression models has become a 
very active area of research in the last decade. In linear models one observes 
independent response random variables G M, 1 < i < n, and assumes that 
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each Yi can be written as a linear function of the i-th observation on a p- 
dimensional predictor vector Xi =: {Xn, . . . , , . . . , Xip), corrupted by noise: 

= X,/3° + ae,, (1) 

where G MP is the unknown regression vector, cr > is the noise level, and for 
each 1 < i < n, the additive term e^, is a mean zero random noise component. 
Postulating that some components of /3° are zero is equivalent to assuming that 
the corresponding predictors are unrelated to the response. The problem of 
predictor selection can be therefore solved by devising methods that estimate 
accurately where the zeros occur. 

More recently, a high volume of literature has been devoted to the selection 
of groups of predictors. This problem requires methods that set to zero entire 
groups of coefficients and is the focus of this work. Group selection arises natu- 
rally whenever it is plausible to assume, based on scientific considerations, that 
entire subsets of the X- variables are unrelated to the response. More generally, 
the need for setting groups of coefficients to zero is a building block in variable 
selection in general additive models and sparse kernel learning, as discussed in 
Meier et al. [13] and Koltchinskii and Yuan [9], among others. Another direct ap- 
plication is to predictor selection in the multivariate response regression model 

Z = UA + E, (2) 

where Z is an n x m matrix in which each row contains measurements on an 
m-dimensional random response vector, U is a n x p observed matrix whose 
rows are the n measurements of a p-dimensional predictor, E is the zero mean 
noise matrix, and A is the unknown coefficient matrix. A predictor Uj is not 
present in this model if the j-th row of A is equal to zero. Using the vectorization 
operator vec, (2) can be written as vec{Z') — {U (E) I)vec{A') + vec(E'). Thus, if 
one treats rows of A as groups, predictor selection in model (2) can be regarded 
as group selection in linear models of type (1). 

Perhaps the most popular method for group selection is the Group-Lasso, 
introduced by Yuan and Lin [27] and further studied theoretically in a number 
of works, including Lounici et al [12], Wei and Huang [26]. The method consists 
in minimizing the empirical square loss plus a term proportional to the sum of 
the Euclidean norms of groups of coefficients. Specifically, let Y = (Yi, . . . , Yn)' . 
We denote hy X £ R"^p the matrix with rows Xi, 1 < i < n, and refer 
to it in the sequel as the design matrix. We assign the individual columns of 
the design matrix and the corresponding entries of the regression vector to 
groups. For this, we consider a partition {Gi, . . . , Gq} of {1, . . . ,p} into groups 
and denote the cardinality of a group Gj by Tj and the minimal group size 
by Tmin := mmi<j<qTj. We then assign all columns of the design matrix X 
with indices in Gj to the group Gj. The corresponding matrix is denoted by 
X^ G K."^^^ . Similarly, for any vector (3 G MP, we assign all components of 
/3 with indices in Gj to the group Gj and denote the corresponding vector by 

G MT'i . We define the active set as 

5:={1<J <g:/3"^V0}. (3) 
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We will denote by 111^112 the Euclidean norm of a generie vector v. Let A > 
be a given tuning sequence. With this notation, the Group Lasso estimator is 
given by 



Optimal estimation of /S*^, and S via the Group-Lasso is very well under- 
stood, and we refer to Buhlmann and van de Geer [4] for an overview. However, 
one outstanding problem remains, and it is connected to the practical choice of 
A that leads, respectively, to optimal estimation with respect to each of these 
three aspects. It is agreed upon that whereas choosing A via cross-validation will 
yield estimates with good prediction and estimation accuracy, this choice is not 
optimal for correct estimation of S. A possibility is to determine first the theoret- 
ical forms of the tuning parameter that yield optimal performances, respectively, 
and then estimate the unknown quantities in these theoretical expressions. One 
important reason for which this approach has not become popular is the fact 
that the respective optimal values of A depend on a, the noise level, and the 
accurate estimation of a when p > n may be as difficult as the original problem 
of selection. A step forward has been made by Belloni et al [1], in the context of 
variable (not group) selection. They introduced the Square-Root Lasso (SRL) 
given below 



The consideration of the square-root form of the criterion was first proposed by 
Owen [16] in the statistics literature, and a similar approach is the Scaled Lasso 
by Sun and Zhang [20]. Belloni et al [1] studied theoretically the estimation and 
prediction accuracy of the SRL estimator 0, and showed that it is similar to 
that of the Lasso, with the net advantage that optimality can be achieved for 
a tuning sequence independent of a. This makes this version of the Lasso-type 
procedure much more appealing when p is large, especially when p > n, and 
opens the question whether the same holds true for pattern recovery, which 
was not studied in [1]. Moreover, given the wide applicability of group selection 
methods, it motivates the study of a grouped version of the Square- Root Lasso. 
We therefore introduce and study the Group Square-Root Lasso (GSRL) 



Our contributions are: 

(a) To extend the ideas behind the Square-Root Lasso for group selection and 
develop a new method, the Group Square- Root Lasso (GSRL). 

(b) To show that the GSRL estimator has optimal estimation and prediction, 
achievable with a cr- free tuning sequence A. This generalizes the results for SRL 
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obtained by [1]. 

(c) To show that GSRL leads to correct pattern recovery, with a cr-free tuning 
sequence A. This provides, in particular, a positive answer to the question left 
open in [1]. 

(d) To propose algorithms with guaranteed convergence properties that scale 
well with the size of the problem, measured by p, thereby extending the scope 
of the existing procedures, which arc performant mainly for small and moderate 
values of p. 

We address (a), (b) and (c) in Section 2 below, and (d) in Section 3. Section 
4 contains simulation results that support strongly our findings. The proofs of 
all our results are collected in the Appendix. 

2. Theoretical Properties of the Group Square-Root Lasso 

In this section, we show that: (i) Nothing is lost by using ||y — X/3||2 instead 
of |jy — XPWl in the definition of our estimator (3 given by (4). Specifically, the 
Group Square-Root Lasso has the same accuracy as the Group Lasso, under 
essentially the same conditions, in terms of estimation, prediction and subset 
recovery, (ii) The net gain is that these properties are achieved via a tuning 
parameter A that is cr-frec, in contrast with the Group-Lasso, which requires a 
tuning parameter A that is a function of a. 

The following notation and conventions will be used throughout the paper. 
We assume that the design matrix is nonrandom and normalized such that the 
diagonal entries of the Gram matrix S := are equal to 1. We denote the 
cardinality of the set S defined in (3) above by s, that is \S\ = s, and refer to 
s as the sparsity index. We set s* := J2jes'^J- denote by Ps £ K** (and 
similarly (Hgc £ 'W~'') the vector that consists of the entries oi f3 £ M.^ with 
indices in UjeS ^jeS" ^j)- Corresponding notation is used for matrices. 

For a generic vector v we denote by Halloo its supremum norm, the maximum 
absolute value of its coordinates. 



2.1. Estimation and Prediction 

We begin with the study of the estimation and prediction accuracy of the Group 
Square-Root Lasso. We first state and discuss the conditions under which these 
results will be established. 

As shown in Theorem 2.1 below, our results hold under the general Com- 
patibility Condition on the design matrix, introduced for the Lasso in [21], and 
extended to this setting in ([4, Page 255]). This condition is a slight relaxation 
of the widely used Cone or Restricted Eigenvalues Condition (see [2]). We refer 
to [22] and ([4, Chapter 6.13]) for a detailed comparison between these two and 
other related conditions. This comparison reveals the fact that the CC condition 
stated below is the weakest requirement on the design under which Lasso-type 
procedures can be guaranteed to perform well. 
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Compatibility Condition (CC): We say that the Compatibihty Condition 
is met for k > and 7 > 1 if 



for all S e A-y, where 

A, ■.= {6eW^:Y^ VT)\\S^\2 < 7E VT)\\^'h}- (6) 

We refer to k and 7 as the compatibility constants and write 

{n,^)eC{X, S). 

The compatibility constant k measures the correlations in the design matrix: 
the smaller the value of k, the larger the correlations. 

For clarity of exposition, we will assume for the rest of the paper that the 
additive error terms have a standard Gaussian distribution. 

The second ingredient in our analysis is the definition of the appropriate noise 
component that needs to be compensated for by the tuning parameter A. The 
proofs of our results reveal thatit is 

V:=m.J^^mi]. (7) 



i<J<9 [ ^Tj\\e\\2 

For 7 > 1 given by condition CC above, let 7 :— ^^r- For given A > define 
the set 

A:={V< A/7} . (8) 

We first establish our result over the set A. We then show, in Lemma 2.1 
below, that the set A has probability 1 — a, for any a close to zero, for an ap- 
propriate choice of the tuning parameter A. Since A will be chosen relative to the 
ratio of the random variables that define V, the factor cr cancels out. This is the 
key for obtaining a tuning parameter A independent of the variance of the noise. 

With 7 > 1 given by CC above, let c = 1 V (1 — 7)^. With k > given by 
CC we assume in what follows that the sparsity index s* is not larger than the 
sample size n. Specifically, we assume that 

' < 

We will show in Lemma 2.1 below that the value of A for which the event A 
has high probability is, in terms of orders of magnitude, no larger than A = 
0{\/nlogq). Therefore, and using the notation < for inequalities that hold up 
to multiplicative constants, the condition on the sparsity index becomes 

logg 
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which rc-cmphasizcs the introduction of s* in this analysis to start with: whereas 
we allow p > n, we cannot expect good performance of any method from a 
limited sample size n, unless the true model has essentially fewer parameters 
than n. 

The following result summarizes the prediction and estimation properties of 
the Group Square- Root Lasso estimator. It generalizes [1, Theorem 1], where 
the Square- Root Lasso is treated, corresponding in our set-up to the special case 

Theorem 2.1. Assume that (k,7) € C{X,S) and that (9) holds. Then, on the 
event A, the following hold with probability 1 — a, for any given a G (0, 1); 



The precise constants in the statements above are given in the proof of this 
theorem, presented in the appendix. Theorem 2.1 is the crucial step in showing 
that the GSRL estimator, which has a tuning parameter free of a, has the same 
optimal rates of convergence as the Group Lasso estimator, see for instance 
Lounici et al [12] or van de Geer and Buhlmann [4]. We will determine the size 
of A in Lemma 2.1 below and state the resulting rates in Corollary 2.1. 



2.2. Correct subset recovery 

We study below the subset recovery properties of the Group Square-Root Lasso. 
Similarly to the analysis of all other Lasso-type procedures, subset recovery can 
only be guaranteed under additional assumptions on the model. 

The first condition is the Group Irrepresentable Condition, which is an ad- 
ditional condition on the the design matrix X. To introduce it, we decompose 



the Gram matrix E with := := ^2,1 := and 

S2,2 := We define £2,1 := (0 Si,2)' and T.-^^ := (0 ^-^^Y . 

Group Irrepresentable Condition (GIR): We say that the Group Irrep- 
resentable Condition is met for 0<77<lifl]i^iis invertible and 




and 




max 



max 



(10) 





We refer to 77 as the group irrepresentable constant and write 
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The Group Irrepresentable Condition imphes the Compatibihty Condition 
discussed above, see for instance [4], and it is therefore more restrictive. How- 
ever, it is essentiahy a necessary and sufficient condition for consistent support 
recovery via Lasso-type procedures, see [28]. We refer to [4, 14, 28, 29] for dif- 
ferent versions of the Irrepresentable Condition and further discussion of these 
versions. 

The second condition needed for precise support recovery regards the strength 
of the signal . Because the noise can conceal small components of the regres- 
sion vector /?", some of its nonzero components need to be sufficiently large to 
be detectable. We formulate this in the Beta Min Condition, similarly to [5] and 
[17, 23]: 

Beta Min Condition (BM): Wc say that the Beta Min Condition is met 
for m e W if 

ll/3°^IU>m„ (11) 
for all j e S. We then write m e B{f3"). 

Note that only one component of /3° in each non-zero group has to be sufficiently 
large, because we aim to select whole groups, and not individual components. 

A slightly different tuning parameter, still independent of a is needed for con- 
sistent subset recovery. Let rj := for r] given by GIR above, and recall that 
7 = jz^, with 7 defined in condition CC above. Define the event 

Ai = {V <X/{jV27f)}. (12) 
Finally, we introduce the following notation 

H(Sl^^^)^-Hoo 

fii.ii := max max ;= . 

' v:\\v''\\2<Vni<J<q y% 

For orthonormal design matrices, ^\\.\\,^ = 1, otherwise ^\\.\\^ > is a constant 
independent of n, p or q. Let a G (0, 1) be given. 

Theorem 2.2. Assume that the conditions CC, CIR and BM are met, and 
that (9) holds. Assume that (k,7) G C{X,S) and rj G I{X,S). Let B > be 
a dominating constant. Then, on the set Ai, we have, with probability greater 
than 1 — a: 

(1) h^=0; 

(2) For all I < j < q, 

\\(0-0°V\\^<B^^. 
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(3) If there exists an m £ B{[i^) such that nij > , for each j g S , then 



Remark 2.1. The constant B depends on 7, 77, k and ^\\.\\^ , but not on n,p, q. 
Its exact form is given in the proof of Theorem 2.2. The results above show that 
the Group Square Root Lasso wih recover the sparsity pattern consistently, as 
long as Ai has high probability, which we show in Lemma 2.1 below. Theorem 
2.2 holds under slightly more general conditions on the design than the variant 
on the mutual coherence condition employed in Lounici et al. [12], for pattern 
recovery with the Group Lasso. Moreover, the recovery is guaranteed for sig- 
nals of minimal strength, just above noise level, which we quantify precisely in 
Corollary 2.1 below. 

Remark 2.2. Theorem 2.2 can be proved only under GIR and BM, as GIR 
implies CC. However, using only GIR would require the derivation of the corre- 
sponding constants under which CC holds, as we will appeal to the conclusion 
of Theorem 2.1 in the course of the proof of Theorem 2.2. Given that the argu- 
ments arc already technical, we opted for stating both assumptions separately, 
for transparency. 

Remark 2.3. The Group Square- Root Lasso can be shown to lead to correct 
subset recovery under sharper Beta Min Conditions, for a constant B indepen- 
dent of if we impose stricter conditions on the design. For example, one 
can invoke the Group Mutual Coherence Condition (GMC) and apply ideas de- 
veloped in [5] to find the condition rnj > yjTjXjn^ which is of the same order 
as above, but holds up to universal constants, independent of the conditions on 
the design. We do not detail this approach here, since the GMC implies GIR, 
and the proof would follow very closely the ideas in [5] . 

2.3. Choice of the Tuning Parameter 

As discussed above, the novel property of the Group Square- Root Lasso method 
is that its tuning parameter A can be chosen independently of the noise level 
a. This is particularly interesting in the high-dimensional setting p ^ n, where 
good estimates of a are not usually available. In determining A for this method, 
we recall that it has to be sufficiently large to overrule the noise component, 
which is independent of ct. 



in that the events A and Ai, given above by (8) and (12), respectively, hold 
with high probability. At the same time, the bounds in Theorem 2.2 and 2.1 
become sharper for smaller values of A. To incorporate these two constraints, 
we chose the tuning parameter as the smallest value that overrules the noise 
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part with high probabihty. For this, we fix a G (0,1) and choose the smahest 
value A such that with probabihty at least 1 — a it still holds that A/7 > V 01 
A/ (7 V 2rj) > V, depending on the type of results we are interested in. Standard 
values for a are 0.05 and 0.01. 

For each j, let C,j = and C = maxj where ||^|| is the operator norm 

of a generic matrix A. 

Lemma 2.1. Assume that the error terms ei, I < i < n, are i.i.d. standard 
Gaussian random variables, and assume that Tj < n, for all 1 < j < q. Let 
a £ (0, 1) be given such that 16\og{2q/a) < n — T„^ax- Then, if 




it holds that 



VKn /, , /21og(2g/a) 



HV > Ao) < a. 



As an immediate consequence, the following corollary summarizes the expres- 
sions of A for which the events A and Ai hold with probability 1 — a, for each 
given a. Notice that A is independent of tr, as claimed. Corollary 2.1 also shows 
that the sharp rates of convergence and subset recovery properties of the Group 
Lasso are also enjoyed by the Group Square-Root Lasso, with the important 
added benefit that the new method's tuning parameter is cr-free. 

Corollary 2.1. Assume that the error terms e,;, 1 < i < n, are i.i.d. standard 
Gaussian random variables and assume that Tj < n, for all 1 < j < q. Let 
a € (0, 1) be given such that 161og(2(7/a) < n ~ T„ 



- max- 



A> 



2Cn7 / , 21og(2q/a) 



y/n - T„, 




then F{A)>l-a. 
(n) If 



^ ^ V2Cn(7V2^) I ^ , /21og(2g/a) 
thenP{Ai) > 1 - a. 

(Hi) Under the assumptions of Theorem 2.1, its conclusion holds with probabil- 



ity at least I — 2a and A = tj?^ \ogq). 

(iv) Under the assumptions of Theorem 2.2, its conclusion holds with probabil- 
ity at least 1 — 2a and A = 0{,/ j?— \ogq). 
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The first two claims follow immediately from Lemma 2.1 and the definitions 
of A and Ai , respectively. The third and forth claims follow directly from the 
first two, by invoking Theorems 2.1 and 2.2, respectively. We only considered 
Gaussian errors above for clarity of exposition. However, more general results 
can be established applying different deviation inequalities, for instance [3, 11, 
24]. For example, if the e^'s belong to a general sub-exponential family, the 
order of magnitude of A remains the same. Additionally, correlations between 
the groups are expected to lead to similar results as for Lasso, see [8, 25], but 
this is of minor interest for this contribution. 



3. Computational Algorithm 

In this section we show that the Group Square-Root Lasso can be implemented 
very efficiently. We consider estimators of a form slightly more general than (4): 

/3:=argminJ + VA,||/3^||2 \ , (13) 

i3mp I ^ ' 

where Ai,...,Aq > are arbitrary given constants. For convenience, we will 
implement, without loss of generality, the following variant 

/3:=argminJ + Va,||/3^'||2 L (14) 

pmp I ~{ ' 

where K is a, fixed, sufficiently large constant. A global minimum of (14), for 
given constants Ai,...,A^, is also a global minimum of (13) with constants 
i^Ai,...,i^A,. 

When q = p, (14) reduces to the Square-Root Lasso, which was formulated 
in the form [1]: 

f ^ 
min — + xy{l3J+ +(33-) 

J = l 

s.t. Vi = ~ xf 13+ + xf f]- , l<i<n, t> \\v\\, 13+ > 0, > 0. 

(15) 

The last three constraints are second-order cone constraints. Based on this conic 
formulation, Belloni et al. [1], have derived three computational algorithms for 
solving the Square- Root Lasso: 

1. First order methods by calfing the TFOCS Matlab package, or TFOCS 
for short; 

2. Interior point method by calling the SDPT3 Matlab package, or IPM for 
short; 
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3. Coordinatcwisc optimization, or COORD for short. 

According to our experience, TFOCS is very slow and inaccurate. COORD is 
reasonably fast, but not as accurate as IMP, especially in applications with a 
large number of parameters p. In computing a solution path, COORD is still 
much slower than the, perhaps most popular, coordinate descent algorithm for 
solving the Lasso [7]. Therefore, even for the Square- Root Lasso, without groups, 
a fast and accurate algorithm is still needed. 

We propose a scaled thresholding-based iterative selection procedure (S- 
TISP) for solving the general Group Square-Root Lasso problem (14). Assume 
the scaling step 

Y ^ Y/K, X ^ X/K (16) 

has been performed. Starting from an arbitrary /3(0) € W , S-TISP performs 
the following iterations to update /3(t), t = 0, 1, . . . : 

pi{t + 1) = 0(/3^(t) + {X^)'{Y - Xm); ^,Um - Yh), l<J<q- (17) 

Here, G is the multivariate soft-thresholding operator [19] defined through 0(0; A) 
and e(a;A) := a0(||a|l2; A)/||a||2 for a^O, where 0(i;A) := sign(i)(|i| - A)+ 
is the soft-thresholding rule. S-TISP is extremely simple to implement and does 
not resort to any optimization packages. 

The following theorem guarantees the global convergence of I3{t). The result 
is considerably stronger than the 'every accumulation point'-type arguments 
that are often seen in numerical analysis. 

Theorem 3.1. Suppose the following regularity condition holds: inf^g^ 11^^ ~ 
Y\\2 > 0, where A = {i)l3{t) + {1 - d)l3{t + l) : d G [0,1], t = 0,1,...}. Then, 
for K large enough, the sequence of iterates /3{t) generated by (17) starting with 
any fi^^^ converges to a global minimum of (14). 

According to our experience, smaller values of K lead to faster convergence 
if the algorithm converges. The choice K — \\X\\/^/2, motivated by display (39) 
in the proof of Theorem 3.1, works well in the simulation studies; we recall that 
[|X[[ is the operator norm of the matrix X. The associated objective function is 
\\Y — X/3[[2 -I- J2j ^ ■ ''^ill/3''l|2 which reduces to the specific form (4) if we set 

A - ^ ./^ 

Other choices of Xj are allowable in our computational algorithm. We suggest 
using warm starts so that the convexity of the problem can be well exploited. 
Concretely, after specifying a decreasing grid for A, denoted by A = {Ai, • • • , Xi}, 
we use the converged solution /3a, as the initial point /3(0) in (17) for the new 
optimization problem associated with Aj+i. 
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4. Simulations 

4.1- Computational Time Comparison for the Square- Root Lasso 

As explained in Section 3 above, the Square- Root Lasso can be computed using 
one of the algorithms TFOCS, COORD, or IMP [1]. As the Square- Root Lasso is 
a special case of the Group Square- Root Lasso (GSRL), corresponding to g = p, 
it can also be implemented via our proposed S-TISP algorithm. In this section 
we compare the three existing methods with ours in terms of computational 
time. We are particularly interested in high-dimensional, sparse problems, when 
p is large and is sparse. Since no competing GSRL algorithms exist, we only 
consider the non-grouped version of S-TISP in the experiments below, for trans- 
parent comparison with published literature on algorithms for the Square- Root 
Lasso, which is only devoted to variable selection, and not to group selection. 

For uniformity of comparison, we used a Toeplitz design as in Belloni at al. [1] 
with correlation matrix [0.5l*~^l]pxp- The noise variance is fixed at 1 and the true 
signal is the p-dimensional, sparse vector /3° = (2.5 2.5 2.5 ••• 0)' . 
The first four components of /3° are fixed. The rest are all equal to zero, and their 
number varies as we vary the dimension of /3° by setting p = 25, 50, 100, 200, 500, 1000, 
in order to investigate the computational scalability of each of the algorithms un- 
der consideration. We set n = 50 for all values of p. Solution paths are computed 
for \/{y/nK) = 2-9, 2-^-^, • • • , 2-°-2, 2°. This grid is empirically constructed to 
cover all potentially interesting solutions as p varies. The error tolerance is le-6. 
Each experiment is repeated 20 times, and we report the total CPU time. 

We used the Matlab codes downloaded from Belloni's website and installed 
some further required Matlab packages, with necessary changes to rescale A. 
We made consistent termination criteria, and suppressed the outputs. In par- 
ticular, we implemented the warm start initiation in COORD which boosts its 
convergence substantially. The original initialization in the COORD relies on 
a ridge regression estimate and is slow in computing a solution path. Table 1 
shows the total computational time for 20 runs of each of the algorithms under 
comparison. 

Table 1 

Computational time comparison (CPU time in seconds) of the first order method by 
calling the TFOCS package, the interior point method (IPM) based on SDPT3, the 
coordinatewise optimization (COORD), and the S-TISP. 





p = 25 


p = 50 


p = 100 


p = 200 


p = 500 p 


= 1000 


S-TISP 


61.7 


88.2 


16.6 


18.5 


29.4 


73.0 


COORD 


14.2 


271.3 


373.4 


74.3 


62.4 


989 


IPM 


121.3 


142.8 


171.4 


222.5 


367.0 


640.5 


TFOCS 


6999 


28816 


19300 


16392 


17743 


16151 



As we can see from Table 1, the algorithms presented in [1] do not scale well 
for growing p, especially when p > n, except for COORD. After comparing the 
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COORD estimates to those obtained by interior point methods (SDPT3 and 
SeDuMi), we found that, unfortunately, COORD is a very crude and inaccurate 
approach. Its inaccuracy is exacerbated by warm starts. We also found that the 
solutions obtained by calling the TFOCS package are not trustworthy for mod- 
erate or large values of p, and that TFOCS is very slow. Our S-TISP achieves 
comparable accuracy to IPM in the above experiments, and its computational 
costs scale well with the problem size. In fact, it provides an impressive com- 
putational gain over the aforementioned algorithms for high-dimensional data, 
that is, large p. For very small values of p (say p = 25), S-TISP is not as efficient 
as COORD, but is fast enough. 

4- 2. Tuning Comparison 

In this part of the experiments, we provide empirical evidence of the advantages 
of the Group Square-Root Lasso in parameter tuning. 

We use the same Toeplitz design as before and set a = 1. The true coefhcient 
vector is generated as /3° = ({2.5}3, {0}3, {2.5}^, {2.5}^, {0}^, • • • , {O}^)' con- 
sisting of three 2.5's, three O's, three 2.5's, three 2.5's, and finally a sequence 
of three O's. Hence, S — 3 and the group sizes are equal to 3. We fix n = 100 
and vary p at 60, 300, 600. Each setup is simulated 50 times, and at each run, 
the Group Square-Root Lasso algorithm, implemented through our proposed 

5- TISP, is called with three parameter tuning strategies. 

(a) Theoretical choice, denoted by TH. This is based on a simplified version of 
the sequence Aq given by Lemma 2.1. To motivate our choice, we first recall the 
notation Q = WX^W^/n and C, ~ maxCj, where \\A\\ is the spectral norm of a 
generic matrix A. Define Tmin — niin Tj , Tmax = max Tj . With this notation, we 
showed in the course of the proof of Lemma 2.1 that the sequence Aq needs to 
be chosen such that, for given a. 



F{V>Xo) < 




where Xt ^^'^ Xn-T- ^^''^ independent variables. Since the ratio of these 

(n— T„ax) 

two variables has a i^-distribution, and with the notation r := — — ,2^ , we 

(C-^^)+ 

further have 

P(V>Ao) < i(l-i^T„„-T, (r)) 

< g(l-FT„,.„,„_T.„.„(T)), 

where fni,n2 denotes the cumulative distribution function of a F-distribution 
with ni and 712 degrees of freedom. Hence, P (V^ > Aq) < a if r > F£^^^_^ „_t„,„(1- 
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a/q) =: tq or, equivalently, if 



Xo>nJ- f ^ ■ (18) 

V JminTo + n — J max 

The proof of Lemma 2.1, in which control of the event {V > Aq) and the deter- 
mination of Ao is done via deviation inequaUties for random variables, can 
be used to show that Aq given by (18) above has the correct order of magnitude. 
Since the calculation involving the F-distribution leading to (18) is more precise, 
we advocate this choice for practical use, for models with Gaussian errors. We 
further use Corollary 2.1 to choose A = Aq, for our particular design. 

Therefore, we use the form (14) in our implementation, with 



Aj = v^CTo/(7'minTo + n- Tn-,^y,)^nTj/K, 

and To = F:^\^ „_T„i„(l - K = \\X\\2/\/2 and a = 0.01. After the optimal 
estimate is located, bias correction is conducted by fitting a local OLS restricted 
to the selected dimensions, to boost the prediction accuracy. 

(b) Cross- Validation (CV). We use 5-fold CV to determine the optimal value of 
A and the associated estimate. Similarly, bias-correction is performed at the end. 

(c) SCV-BIC [19]. We cross- validate the sparsity patterns instead of the values 
of A. Unlike K-fo\c\ CV, only one penalized solution path needs to be generated 
by running the Group Square- Root Lasso on the entire dataset. This determines 
the candidate sparsity patterns. Then, we fit restricted OLS in each CV training 
to evaluate the validation error of the associated sparsity pattern and append a 
BIG correction term to the total validation error. SCV-BIC is much less expen- 
sive than CV, noting that the OLS fitting is cheap, and has been shown to bring 
significant performance improvement, see [19] for details and [6] for a similar 
approach. 

To measure the prediction accuracy, we generated additional test data with 
Ntest =le-|-4 observations in each simulation. The effective prediction error is 
given by MSE = 100 • (E^*r'(y» " xf 'P? / {Ntest'j'') - 1). We found the his- 
togram of MSE is highly asymmetric and far from Gaussian. Therefore, the 
40% trimmed-mean (instead of the mean or the somewhat crude median) of 
MSEs was reported as the goodness of fit of the obtained model. We charac- 
terize the selection consistency by computing the Miss (M) rate ~ the mean of 
\{j : pio -L o,/3JO = 0}\/\{3 : I3^° ^ 0}[ over ah simulations, where [ • [ is the 
cardinality of a set, and False Alarms (FA) rate - the mean of \{j : /3^° = 
0,/?^'^ 7^ 0}|/|{j • 1^"'° = 0}| over all simulations. Correct selection occurs when 
M = FA = 0. 

We conclude from Table 2 that the selection by CV is acceptable, especially 
in high-dimensional, sparse problems, but it has the worst behavior relative 
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Table 2 

Performance of Group Square-Root Lasso Tunings — CV, SCV-BIC, and the 
theoretical choice (TH), in terms of miss rate (M), false alarm rate (FA), and 
prediction error (MSB). 





p = 60 


p = 300 


p = 600 




M FA MSE 


M FA MSE 


M FA MSE 


CV 
SCV-BIC 
TH 


0% 12.75% 23.21 
0% 0% 9.82 
0% 0% 9.82 


0% 2.56% 22.16 
0% 0.02% 9.99 
0% 0% 9.99 


0% 0.80% 18.36 
0% 0.01% 8.95 
0.67% 0% 9.20 



to the other tuning strategies. SCV-BIC gives excellent prediction accuracy 
and recovers the true sparsity pattern successfully. It is much more efficient 
than CV but still requires the computation of one Group Square-Root Lasso 
solution path. The theoretical choice (TH) directly specifies the value for the 
regularization parameter and there is no need for a time-consuming grid search. 
For Gaussian errors, this particular TH gives almost comparable performance 
to SCV-BIC in terms of both prediction and variable selection accuracy. 

Appendix 

Proofs for Section 2 

Throughout this section we will make use of the following basic fact. 
Lemma 4.1. For a given a e (0, 1), let t = ^ 4in(i/a) ^_ 4in(i/a) ^^yj^g 

B:={\\eh/V^^<VT+t}. (19) 

Then, 

P(S) > 1 - a. 

The proof of this result is a direct application of Lemma 8.1 in ( [4]). Notice 
that, on B, we have ||e||2/-\/'^ < C, for a dominating constant C. We will make 
implicit use of this fact throughout. 

Proof of Theorem 2.1. In the ffi^st step of the proof, we show that 5 := /3 — /3° € 
A.^. The desired bounds are then derived in a second step. 

For the first step, we note that the definition of the estimator (4) implies 

- < vT; (ii^«b - \mu) . 
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and simple algebra yields 



i=i 



jes- 



jes- 



These two inequalities give 



(20) 



Next, we bound the error term. We obtain, via an application of the Cauchy- 
Schwarz's Inequality, and recalling the definition of the error term V: 



\e'XS\ = \j2e'X^5^\ 

< Ell(e'^-'')'ll2!l?^ll 



i=i 



< max 

l<j<q 



\ VT,\\eh 



.7 = 1 



V 



.7 = 1 



(21) 



We then observe that 



Vp=po\\Y-Xp\\ 



-X't 



and and use Inequality (21) and the fact that any norm is convex to obtain 
\\Y^XM2 \\Y-X(3% ^_\e'XS\ 



>-^Ev^ii?^ii- 

j=i 



Since on the set A we have A/7 > V, we further obtain 



j=i 



(22) 
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Combining (20) and (22), wc find 



and thus 



1\ A 



J J n 



l\ A 



"f J n 



This imphes ^Ejes^ y^WS^U < j ^ Ejes V^P' II2 and since A > 0, 
we obtain 

E V^II^'ll2<7Ev^ll^'ll2' (23) 
jeS" jes 

or equivalently, 6 G A^, as desired. 

To derive the bounds stated in the theorem we begin by observing that 



' \\Y-Xdh \\Y-X/3% Y (1V(7~1)^)A^ 



EV?7ll?^ll2 



by (20) and (23). Next, we write 

\\Y-X^\\l \\Y -Xp"\\l _ \\X5-cre\\^2 \\aeg _ \\XS\g 2ae'XS 
n n n n n n ^ 

and we use (20), (21), (23), and (24), to obtain 

WXSWl _ \\Y-XMl \\Y-X/3m , 2<je'XS 



(24) 



< 



n n n n 

2\\Y-XI3% ( WY^xph !|y-x/?"|j2\ 



\Y-XPh \\Y-XI3''\ 



2V\\ae\\ 



73/2 



■Ev^ii^^i 



< 



2A|k6H 

„3/2 



Vies ies= 



2V\\<Jeh 

„3/2 



EVT}ii?^ii2 



< 



1\ 2\\\aeh ^ 



(1V(1-7)^)A2 



Ev^ii^'ii2 

Vies 



2 + 



(1V(1-7)2)A2 



ies 



Ev^ii^'i 

Vies 
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since on A wc have A/7 > V. Therefore, by the Compatibihty Condition (5), 
\\X6g ^ / ^ 1\ 2\\\ae\\2 V^\\X5\\2 , {1 V {1 ~ s* \\X6\\l 



and hence 



^ _ (lV(l-7)^)A^.s* \ \\X5\\i < A ^ 1 ^ 2Vs*\\WeMX6h 



As a consequence, 



\XS\\i ^ uV¥X\\aeh\\Xd\\2 



n n^K 



where u =: — (iv(i-'^)'i)\-is* ^ (^i 00), by assumption (9). Since on B we have 

7(1 -TT^ ) 



the first statement of the theorem follows: 



< (tVT+I, 



\\X{(i - /3°)||2 < aVT+t-^^ < VT+t 



K\/n K\/n 



For the second claim, we use the fact that 6 G and the Compatibility Con- 
dition (5) to deduce that 



(7 + l)V^!I^Y(/3-/30)||2 / A(7 + l)us* ||<7e||2 



J2VT)\\^'h<il+^)J2VT)\\Sm2< '' ' • '"^ < 

Therefore, again on the set B, we have 

t^/T-||(,^-ffl^|l.<.vTT^i<2±il^^:<v^ 

which concludes the proof of this theorem. □ 

We next prove Theorem 2.2. We begin with two preparatory results. 

Lemma 4.2. Assume that Y ^ XP over some set C . Then, on C, the quantity 
P is a solution of the criterion (4) if and only if for every I < j < q 

PH^^^^^^-2m^^^P (25) 

\\Y-Xp\\2 ^/^W^\2 

p=,^\\(^'(y-^ph^h^, (26) 
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Proof. Since all terms of the criterion (4) are convex, and thus, the criterion is 
convex, we can apply standard subgradient calculus. The subgradient dxf of a 
convex function / : K at a point x £ M^* is defined as the set of vectors 

vGRp such that for all yeW 

f{v)>.f{x)+v'{v-x). 

From this, one derives easily that subgradients are linear and additive and that 
the subgradient dxf is equal to the gradient Vxf if the function / is differentiable 
at X. Moreover, x € W is a minimum of the function / if and only if € dxf ■ 
Since Y ^ X/3, the first term of the criterion (4) is differentiable and we have 

For the remaining terms, we observe that for any vector u G M^\{0}, T e N, 

V.N|. = ^ (28) 

and for u = 

w e (9„=ollw!l2 ^ II^IU > IIOII2 + (z - 0)'w = z'w foraUzSM'^ (29) 
and, consequently, 9ti=o||it||2 = {w e M-^ : ||u||2 < 1}. 

The claim follows then from Equations (27), (28), and (29). □ 

Lemma 4.3. Under the conditions of Theorem 2.1, it holds that, on the set 
Ad B, we have 



1 - — ||ae!i2 < ||y - xph < 1 + — !ke|l2. 

UK / \ UK / 

Proof. By the triangle inequality 

||ae||2 - \\X{p~ /?o)||2 <\\Y- XM2 < |ke||2 + \\X0 - Mh- 

The claim follows immediately by Theorem 2.1 above. □ 

Proof of Theorem 2.2. The crucial step in this proof is to use the KKT Condi- 
tions in Lemma 4.2 in order to show that, on Ai D B, we have /Sgc = 0. 

First, we observe that Lemma 4.3 implies that Y — X/3 7^ on ^ n S, 
and we can consequently apply the KKT Conditions derived in Lemma 4.2 for 
C = An B. Moreover, since by definition, AiDB C AO B, the results also hold 
on the smaller set. Thus, there exists a vector r £ such that 11 112 < \/Tj 



for all 1 < j < q and, additionally, = ' ^'^^ a-H 1 J ^ 9 such that 

^ 0, and T satisfies the equality 

X'jY - Xf3) _ 2^ 
\\Y-X(3\\2 " V^^' 
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We rewrite this with 



u:= \\Y -XI3\\2 and?:= 



as 



uX 



(30) 



aX'e - X'XS = ^r. 
So, on the one hand, we have 

-n'^T.i,iSs — n^i;i,2^s<= ~ V^uXts - na{X'e)s, 
or, equivalently, 

-n^Ss - n^i;ijS]i^2'5sc = y/nuXT.]^iTs - no-E"^^j(X'e)s, 

and finahy 

-n'^6g,Y.2,iSs~n'^S'saT,2,iT,\^i'Ei,2Ss-= = \/7iuA(55cS2,iSi\Ts-nCT5scS2,iSi\(X'e)s. 

' (31) 

On the other hand, we have 

-n'^T,2.iSs — n^^2,2Ss'= — V^uXts" — na{X'e)s'=- 
Since for j G S"^ 

^ = ^ p • T^' = = ^/T)\\sm2, 

this imphes that 
-71^6^0^,2^163 — n^(5^cS2,2<5s<: = -v/n^'^^s'<:'''S'^ ~ na6gc{X'e)s<: 



6^ ■ {X'ey 
y^Xu 



The right-hand side can be bounded from below, using Cauchy-Schwarz's In- 
equahty, by 



V^uX Vt] (¥'\\2-\\6'\ 



s/T)Xu 



Lemma 4.3 imphes that A/77 > F fo^' 



V := max 

i<j<q 



<^V^\\{x'ey\ 
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and thus, the above term can be bounded from below by 



So, in summary, we have 

-n%^j:2,iSs - n%^j:2,2Ss^ > ( 1 - 4 ) V^uX VT)Wh- (32) 



Subtracting Equation (32) from Equation (31) then yields 

n^?Sc (1)2,2 - ^2.1^~i^i^l.2)Ss'^ 

The first term of the right-hand side above can be bounded via the Cauchy- 
Schwarz's inequality by 



Now, we observe that if Xj^ > V, then ^\\{X'ey\\2 < ^ for all < j < cj, 
and thus, the above expression can be bounded by 

VnuX max > V^jW h 7^= 

l + -]y/nuX max > ^Tjl (5^1 2 j= • 



If 7^ 0, this is strictly smaller than 

by our Group Irrepresentable Condition. Then, by inequality (33), this yields 

n'^Sg4^2,2 - S2,iS"i\Si,2)i5s- < 0. 

But since £2,2 — 5]2,i5]i^iSi,2 > 0, this leads to a contradiction. Hence, (5^= = 
and the first claim is proved. 
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For the second claim, we invoke Ss'= = to obtain, using Equation (30), 
-7i^Si^i(5s = \/nuXTs — n(T{X'e)s. 

This implies 

-n ds = VnuXT,-!^^^ \ts ^ I 

and, using X/r] < V and bounding the norms as above, 

\\d^\o,< max A V 



< 



< 



1 + 77 " " n 

X FT'- 

< B ^ \ for all 1 <j < q, 
n 

which is the second claim of this theorem. In the above derivation we used 
Lemma 4.3 above for the third inequality, and assumption (9) for the forth and 
the fact that VI + 1 is bounded by a constant, by the definition of the set B 
in Lemma 4.1 above. We also recall that under (9), the quantity u is a positive 
constant. 

The third claim follows immediately from the first two and the Beta Min Con- 
dition. This concludes the prof of this theorem. □ 

Proof of Lemma 2.1. Wc first observe that 

P(y>A„).rf„.aJv^lli"^'''ll4>A„) 



max e 

l<j<q 



< 

3 



\ n ) 



Let U' {j)D{j)U{j) be a spectral decomposition of X^{X^)' /n such that U{j) 
is orthogonal and D{j) is diagonal with diagonal entries ^i(j) > • • • > Ct (j) > 
£,T,+i{3) = • • • = C„(j) = 0. With the notation Cj = WX^W^/n, where |mfis the 
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spectral norm of a generic matrix A, we have Cj = It follows that 



U'{])D{j)U{j) 



= (C/(j»' ( - ) [/(.7> 



<lkill2 0- 



I ||2 ^0-^J 



where ei and 62 are independent with ei ^ jV{0,lTj) and £2 ^ A/'(0, /(„_Tj))- 
Thus, for any fixed r G (0,1) we have 



'(^>Ao) < ^] 



< 



E 

i=i 
9 

E 



killi 



AnT, 



■ — ^ - T,)+ > 1 - r 



1 



k.Wi < l-r . 



(34) 



XT' 

If Cj ^ then the first sum in the inequality above is trivially equal 

to zero, therefore the argument below is needed only when the reverse in- 
equality holds. From Laurent and Massart [10, Lemma 1], F{X — d > dt) < 

exp(-|(v/(l + 2t- 1)2) and P{X < d - dt) < cxp ^it'^) , for X - x^{d)- 

Therefore, for the first term in (34) we obtain, for each j: 



A§ 



n — To 



> l-r 



< 



exp 



2(1 -r){n-T,) ^ ^ 



"-Ci 



< exp 



V 



A 



2(l-r)(n-r.nax) 

n^C _ rp ^ 
^2 ^ min 



1-1 



To bound the last term in (34) we first obtain, for each j: 



n - Ti 



< 1 — r < exp 



< exp 



(n - r,nax)''2 
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Hence, 



^{V>Xo) 
/ 

< q ■ exp - 

V 



T 



2{l -r){n-T„, 

\ 'A min 



- 1 - 1 



q ■ exp 



For r = 2 
with 



log(2g/o 



the last term is bounded by a/2. For this value of r and 



Ao 



2Cn 



1 



'21og(2g/a) 



the first term is also bounded by a/2. This concludes the proof. 



□ 



Proofs for Section 3 

Lemma 4.4. Given any \, is nonexpansive, that is, |j8(a;; A)— A)||2 < 

\\x — x\\2 for all x,x W . 

Proof. Define A \\x ~ S:\\'?. - \\e{x]\) - e{i;X)\\?., a := \\x\\2, b := \\i\\2, and 
c :— x'x/{ab). It holds that |c| < 1. Moreover, by the cosine rule, 

\\x — x\\2 ~ + — 2abc 
\\eix; A) - e(i; = ((a - + f + {{b - \)+)^ - 2{a - A)+(6 - A)+c. 

(i) Suppose a, & > A. Then, A = -2X'^ + 2{a + b)X + 2\^c - 2A(a + b)c = 
2{l-c)A(a + 6- A) > 0. 

(ii) Suppose a < A and b>\. Then, A = + 6^ - 2a5c - (6 - A)^ = - 2abc - 
A^ + 26A > - 2ab - A^ + 2&A = (26 - a - A)(A - a) > 0. 

Similarly, we can show that A > for a > A and b < \ and for a,b < X. 
Therefore, ||e(a;;A) -e(i;A)||2 < \\x - x\\2. □ 

Proof of Theorem 3.1. By Lemma 4.4, the mapping (17) is nonexpansive. We 
use Opial's conditions [15, 18] for nonexpansive operators to prove the strict 
convergence of /3(t). 

First, the fix point set of the mapping is non-empty due to the convexity and the 
KKT conditions. It remains to show that the mapping is asymptotically regular: 
for any starting point /3(0), \\(3{t + 1) — /3(0I1 — > as t — > oo. 
Assume that the scaling operations (16) have performed beforehand. Let F{f3) := 
\\Y — Xf3\\2 + Aj||/?''||2 be the objective function. Introduce a surrogate 
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function 



G(/3,7) := \\Y-X^h + 



(7-/3)'X'(X/3-y) 



\\XI3~Y\\2 



(35) 



Given /3, algebraic manipulations show that minimizing G over 7 is equivalent 
to 



mm -r 

7 



Il7 - [/3 + X'Y - + - E 



Applying Lemma 1 and Lemma 2 in [19], we have the optimal 70 given by 

^i=e{l3^ +{X^y{Y-X(5y,X,\\XI3-Yh), l<j<q, (37) 
and further obtain 



G(A7o + ^)-G(/3,7o)> 



11-5111 



nX(3-Y\\2- 
Therefore (sec (17)), 

G(/3(t),/3(t))-G(/3(t),/3(t + l)) 



1 1 



■2\\xm-Yh 

On the other hand, a Taylor series expansion yields 
1 



(38) 



\\Y-XI3\\2 + 
=-(/5-7)' 



\\Xf3-Yh 
1 



ij-pyX'iXp-Y)-\\Y -X^h 
1 



|xe-r||2-'" ll^7->^lli 

for some ^ = + (1 - ?9)7 with ?9 e (0, 1). This implies 

1 



X'{X^~Y){X^-YyX 



\\Y-X^\\2-\\Y-Xp\\2- 



\\XI3^Y\[ 



(7-/3)'X'(X/3-r) 



<(/?-7)' 



1 



(/3-7). 
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Using this and (38), we obtain 

Fm + D) + (Pit + 1) - my {^xmrw/ + nxcw-m. ^'^) + " ^^^^ 

< G(/3(i),/3(t + l)) 

< G(/3W,/3W) - i^^^^-i-^(/3(t + 1) - /3(t))'(/3(t + 1) - /3(t)) 

for some ^{t) = i){t)l3{t) + (1 - i}{t))(3{t + 1) with t?(t) e (0, 1). Therefore, with 
11^ 

denoting the operator norm of the matrix X, 

WW) - W(« + D) > + iiWw-Vlb ) + " - '''' 

Under the regularity condition and for K large enough, F{/3{t)) is monotonously 
decreasing. We get 

and so + 1) — /3(t)||2 — > due to the regularity condition. With all of Opial's 
conditions satisfied, has a unique limit point f3* . It is easy to verify that P* 
as a fixed point of (17) satisfies the KKT conditions (25) and (26). This means 
that (3* is a global minimizer. □ 
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